Job submission with GPU: Running batch jobs#

For submitting a batch of jobs (like multiple commands) you need to create a bash script. The script in general needs these things.

[s.1915438@sl1 pytorch_gpu_check]$ cat check_gpu.sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=00:00:40
#SBATCH --ntasks=1
#SBATCH --job-name=gpu_check
#SBATCH --output=gpu.%j.out
#SBATCH --error=gpu.%j.err
#SBATCH --gres=gpu:1
#SBATCH --account=scw1901
#SBATCH --partition=accel_ai


module load anaconda/3
source activate ml
python gpu.py
[s.1915438@sl1 pytorch_gpu_check]$

This file is self explanatory. A better description can be found here: https://www.carc.usc.edu/user-information/user-guides/software-and-programming/anaconda

The output and error files are basically used to pipe the output from the shell where %j is the job number.

The python file is still the same.

[s.1915438@sl1 pytorch_gpu_check]$ cat gpu.py
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")

try:
    print(f"Current Devices: {torch.cuda.current_device()}")
except :
    print('Current Devices: Torch is not compiled for GPU or No GPU')

print(f"No. of GPUs: {torch.cuda.device_count()}")

try:
    print(f"GPU Name:{torch.cuda.get_device_name(0)}")
except :
    print('GPU Name: No GPU available')

Run bash file#

Here we use SBATCH command to run the file.

[s.1915438@sl1 pytorch_gpu_check]$ sbatch check_gpu.sh
Submitted batch job 7133028

The output file is gpu.7133028.out and the error file is gpu.7133028.err.

[s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.out
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 1
GPU Name:NVIDIA A100-PCIE-40GB
[s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.err
[s.1915438@sl1 pytorch_gpu_check]$